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Abstract 

We consider multivariate two-sample tests of means, where the location shift 
between the two populations is expected to be related to a known graph structure. 
An important application of such tests is the detection of differentially expressed 
genes between two patient populations, as shifts in expression levels are expected 
to be coherent with the structure of graphs reflecting gene properties such as 
biological process, molecular function, regulation, or metabolism. For a fixed 
graph of interest, we demonstrate that accounting for graph structure can yield 
more powerful tests under the assumption of smooth distribution shift on the 
graph. We also investigate the identification of non-homogeneous subgraphs 
of a given large graph, which poses both computational and multiple testing 
problems. The relevance and benefits of the proposed approach are illustrated 
on synthetic data and on breast cancer gene expression data analyzed in context 
of KEGG pathways. 
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1 Introduction 



The problem of testing whether two data generating distributions are equal has been 
studied extensively in the statistical and machine learning literatures. Practical appli- 
cations range from speech recognition to fMRI and genomic data analysis. Parametric 
approaches typically test for divergence between two distributions using statistics based 
on a standardized difference of the two sample means, e.g., Stu dent's ^-statistic in the 
univa riate case or Hotelling's T 2 -statistic in the multivariate case Lehmann and Romanol . 
2005]. A variety of non-param etric rank-based tests hay e also been proposed. More re- 



cently, lHarchaoui et al.l [20071 ] and iGretton et all |2007] devised kernel-based statistics 
for homogeneity tests in a function space. 

In several settings of interest, prior information on the structure of the distri- 
bution shift is available as a graph on the variables. Specifically, suppose we ob- 
serve {X}, . . . , X^} G M p from a first multivariate normal distribution N{ni, S) and 
{Xf,...,X% 2 } G MP from a second such distribution A/"(//2,£). In cases where an 
undirected graph Q = (V, £) encoding some type of covariance information in W is 
given, the putative location or mean shift 8 = /ii — ^ ma Y be expected to be coherent 
with Q. That is, 5 viewed as a function of Q is smooth, in the sense that the shifts 
Si and Sj for two connected nodes Vi and Vj G V are similar. Classical tests, such as 
Hotelling's T 2 -test, consider the null hypothesis Ho : [i\ — [ii against the alternative 
Hx : \i\ ^ /^2, without reference to the graph. Our goal is to take into account the 
graph structure of the variables in order to build a more powerful two-sample test of 
means under smooth-shift alternatives. 

Just as a natural notion of smoothness of functions on a Euclidean space can be 
defined throu gh the notion of Dirichl et energy and cont rolled by Four ier decomposition 



and filtering [Stain and Weissl . Il97l| . it is well-known [Chungi . 11997] that the smooth- 
ness of functions on a graph can be naturally defined and controlled through spectral 
analysis of the graph Laplacian. In particular, the eigenvectors of the Laplacian provide 
a basis of functions which vary on the graph at increasing frequencies (corresponding 
to the increasing eigenvalues). In this paper, we propose to compare two populations 
in terms of the first few components of the graph-Fourier basis or, equivalently, in the 
original space, after filtering out high-frequency components. 

An important motivation for the development of our graph-structured test is the 
detection of groups of genes whose expression changes between two conditions. For 
example, identifying groups of genes that are differentially expressed (DE) between 
patients for which a particular treatment is effective and patients which are resistant to 
the treatment may give insight into the resistance mechanism and even suggest targets 
for new drugs. In such a context, expression data from high-throughput microarray and 
sequencing assays gain much in relevance from their association with graph-structured 
prior information on the genes, e.g., Gene Ontology (GO; http://www.geneontology.org) 
or Kyoto Encyclopedia of Genes and Genomes (KEGG; |http : //www . genome . jp/keggP . 
Most approaches to the joint analysis of gene expression data and gene graph data 
involve two distinct steps. Firstly, tests of differential expression are performed sepa- 
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rately for each gene. Then, these univariate (gene-level) testing results are extended 
to the level of gene sets, e.g., by assessing the over-representation of DE genes in 
each set based on p- values for Fisher's exact test ( or a y 2 approximation thereof) ad- 
justed for multiple testing Beissbarth and Speed! 2004 1 or based on permutation ad- 



juste d p-values for weighted Kolmogorov-Smirnov-like statistics [Subramanian et al. 



20051 ] . Another family of methods directly performs multiv ariate tests of d ifferen- 
tial ex pression for groups of genes, e .g., Hotelling's T 2 -test |Lu et al.l . 120051 ] . It is 
known Goeman and Biihlmannl . 120071 ] that the former family of approaches can lead 
to incorrect interpretations, as the sampling units for the tests in the second step be- 
come the genes (as opposed to the patients) and these are expected to have strongly 
correlated expression measures. This suggests that direct multivariate testing of gene 
set differential expression is more appropriate than posterior aggregation of individ- 
ual gene-level tests. On the other hand, while Hotelling's T 2 -statistic is known to 
perfo rm well in small dimension s, it loses power very quickly with increasing dimen- 

essentially because it is based on the inverse of the 



Bai and Saranadasal . 11996 



sion 

empirical covariance matrix which becomes ill-conditioned. In addition, such direct 
multivariate tests on unstructured gene sets do not take advantage of information on 
gene regulation or other relevant biological properties. An increasing number of regu- 
lation networks are becoming available, specifying, for example, which genes activate 
or inhibit the expression of which other genes. As stated before, incorporating such 
biological knowledge in DE tests is important. Indeed, if it is known that a particular 
gene in a tested gene set activates the expression of another, then one expects the 
two genes to have coherent (differential) expression patterns, e.g., higher expression of 
the first gene in resistant patients should be accompanied by higher expression of the 
second gene in these patients. Accordingly, the first main contribution of this paper 
is to propose and validate multivariate test statistics for identifying distribution shifts 
that are coherent with a given graph structure. 

Next, given a large graph and observations from two data generating distributions 
on the graph, a more general problem is the identification of smaller non-homogeneous 
subgraphs, i.e., subgraphs on which the two distributions (restricted to these sub- 
graphs) are significantly different. This is very relevant in the context of tests for gene 
set differential expression: given a large set of genes, together with their known regula- 
tion network, or the concatenation of several such overlapping sets, it is important to 
discover novel gene sets whose expression change significantly between two conditions. 
Currently-available gene sets have often been defined in terms of other phenomena 
than that under study and physicians may be interested in discovering sets of genes 
affecting in a concerted manner a specific phenotype. Our second main contribution is 
therefore to develop algorithms that allow the exhaustive testing of all the subgraphs of 
a large graph, while accounting for the multiplicity issue arising from the vast number 
of subgraphs. 

As the problem of identifying variables or groups of variables which differ in distri- 
bution between two populations is closely-relate d to supervised learnin g, our proposed 
approach is similar to several learning methods. iRapaport et al.l 20071 ] use filtering in 
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the Fourier space of a graph to train linear classifiers of gene expression profiles whose 
weights are smooth on a gene network. However, their classifier enforces global smooth- 
ness on the large regularization network of all the genes, whereas we are concerned with 
th e selection o f gene sets with locally-smooth expression shift between populations. 



In I Jacob et al.l [20091 ] . sparse learning methods are used to build a classifier based on 
a small number of gene sets. While this approach leads in practice to the selection 
of groups of variables whose distributions differ between the two classes, the objective 
is to achieve the best classification performance with the smallest possible number of 
groups. As a result, correlated groups of variables are typically not selected. Other re- 



lated work includes iFan and Linl [1998] , who proposed an adaptive Neyman test in the 
Fourier space for time-series. However, as illustrated below in Section [5], direct transla- 
tion of the adaptive Neyman statistic to the graph case is problematic, as assumptions 
on Fourier coefficients which are true for time-series do not hold for graphs. In addition, 
the Neyman statistic converges very slowly towards its asymptotic distribution and the 
required calibration by bootstrapping renders its application to our subgraph discovery 
context difficult. By contrast, other methods do not account for shift smoothness and 
try to address the loss of power caused by the poor conditioning o f the T 2 -statistic by 
applying it after dimensionality reduction [Ma and Kosorokl. 12009 1 or by omitting; the 



inverse co variance ma t rix and adjusting in stead by its trace [Bai and Saranadasal . 11996 



Chen and Qinl . [2010(. IVaske et al.l [20101 ] recently proposed DE tests, where a proba- 



bilistic graphical model is built from a gene network. However, this model is used for 
gene-level DE tests, which then have to be combined to test at the leve l of gene sets. 



Several approaches for subgraph discovery, like that of lldeker et al.l [20021 ]. are based on 



a heuristic to identify the most differentially expressed subgraphs and do not amount 
to testing exactly all the subgr aphs. Concerning the discovery of distribution-shifted 



subgraphs, IVandin et al.l [2010] propose a graph Laplacian-based testing procedure to 



identify groups of interacting proteins whose genes contain a large number of mu- 
tations. Their approach does not enforce any smoothness on the detected patterns 
(smoothness is not necessarily expected in this context) and the graph Laplacian is 
only used to ensure that very connected genes do not le ad to spurious det ection. The 
Gene Expression Network Analysis (GXNA) method of iNacu et al.l 2007] detects dif- 
ferentially expressed subgraphs based on a greedy search algorithm and gene set DE 
scoring functions that do not account for the graph structure. 

The rest of this paper is organized as follows: Section [2] introduces elements of 
Fourier analysis for graphs which are needed to develop our method. Section [3] presents 
our graph-structured two-sample test statistic and states results on power gain for 
smooth-shift alternatives. Section H] describes procedures for systematically testing all 
the subgraphs of a large graph. Section presents results for synthetic data as well 
as breast cancer gene expression and KEGG data. Finally, Section summarizes our 
findings and outlines ongoing work. 
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2 Fourier analysis on graphs 



The fundamental idea of harmonic analysis for functions defined on a Euclidean space is 
to build a basis of the function space, such that each basis function varies at a different 
frequency. The basis functions are typically sinusoids. They were originally obtained 
in an attempt to solve the heat equation, as the eigenfunctions of the Laplace operator, 
with corresponding eigenvalues proportional to the frequencies of the sinusoids. Any 
function can then be decomposed on the basis as a linear combination of sinusoids 
of increasing frequency. The set of projections of the function on the basis sinusoids 
gives a dual representation of the function, often referred to as Fourier transform. This 
representation is useful for filtering functions, by removing or shrinking coefficients as- 
sociated with high frequencies, as these are typically expected to reflect noise, and then 
taking the inverse Fourier transform. The resulting filtered function contains the same 
signal in the low frequencies as the original function. A related concept is the Dirichlet 
energy of a function / on an open subspace Q, defined as |V f(x)\ 2 dx where V 
is the gradient operator, a measure of variation that is consistent with the Laplace 
operator. In particular, the Dirichlet energy of the basis functions is proportional to 
their associated frequencies. 

For functions on a Euclidean space, natural notions of smoothness, along with the 
Dirichlet energy and dual representation in the frequency domain by projection on 
a Fourier basis, are therefore classically defined from the Laplace operator and its 
spectral decomposition. Likewise, notions of smoothness for functions on graphs can 
be defined based on the graph Laplacian. Specifically, consider an undirected graph 
Q = (V, £), with |V| = p nodes, adjacency matrix A, and degree matrix D = Diag (Al), 
where 1 is a unit column- vector, Diag(x) is the diagonal matrix with diagonal x for 
any vector x, and Da = di. Let / : M' v ' — > R denote a function that associates a real 
value to each node of the graph Q. The Laplacian matrix of Q is typically defined as 
£, = D — A or C norm = I — D~zAD~z for the normalized version. More generally, 
given any gradient matrix V € R' £ ''' v ', defined on Q and associating to each function on 
the graph its variation on each edge, it is possible to derive a corresponding Laplacian 
matrix following the classical definition of the Laplace operator, C = — divV = V T V, 
where div is th e divergence operator defin ed as the negative of the adjoint operator 



of the gradient [Zhou and Scholkopi 120051 ]. Any desired notion of variation may be 



encoded in a gradient function and thus translated into its associated Dirichlet energy 
7}f T £f, for a function / defined on the graph Q. A common choice of gradient is the 
finite difference operator V/ = (fi — fj) i eV . This definition leads to the unnormal- 

ized Laplacian above. The corresponding energy function is iJ^ijev^ — fi) 2 - Let 
C = UAU T denote the spectral decomposition of the Laplacian, where A is the diag- 
onal matrix of eigenvalues A, and the columns of the matrix U are the corresponding 
eigenvectors Then, by definition, the eigenvectors of C are functions of increasing 
energy, as uj C,Ui = A, for all i — 1, . . . , p. In the remainder of this paper, we denote 
by / = U T f the Fourier coefficients of a function / defined on a graph. 
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If the above two notions of smoothness are not appropriate for a particular appli- 
cation, other gradients, leading to other Laplacian matrices, may be devised to build 
the function basis. For example, introducing weights on the edges of a graph and using 
these weights in the normalized version of the finite differences allows the incorporation 
of prior belief on where a shift in distributions is expected to be smooth. For appli- 
cations like structured gene set differential expression detection, one may use negative 
weights for edges that reflect an expected negative correlation between two variables, 
e.g., a gene i whose expression inhibits the expression of another gene j. In this case, 
a small variation of the shift on the edge between i and j should correspond to a small 
\8i + Sj\. Accordingly, the gradient should be defined as (fi — Sijfj) i . gV , where Sij is 
— 1 for negative interactions and 1 for positive interactions. The eigenvectors of the 
corresponding Laplacian £ S i gn are functions of increasing | X^jev (/* ~~ s ijfj)> an a P~ 
propriate notion of smoothness for the application at hand. A signed Laplacian can 
be recovered from the classical definition £ S i gn = D — A sign , where A, ign is allowed to 
have negative entries. Not e that such a sm oothness function is used as a penalty for 
semi-supervised learning in iGoldbergi [2007 . 

As an example, Figure Q] displays the eigenvectors of the signed Laplacian £ S i gn for 
a simple four-node graph with 
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The first eigenvector, corresponding to the smallest frequency (eigenvalue of zero), can 
be viewed as a "constant" function on the graph, in the sense that its absolute value 
is identical for all the nodes, but nodes connected by an edge with negative weight 
take on values of opposite in sign. By contrast, the last eigenvector, corresponding to 
the highest frequency, is such that nodes connected by positive edges take on values of 
opposite sign and nodes connected by negative edges take on values of the same sign. 



3 Graph-structured two-sample test of means un- 
der smooth-shift alternatives 

For multivariate normal distributions, Hotelling's T 2 -test, a classical test of location 
shift, is known to be uniformly most powerful invariant against global-shift alternatives. 
The test statistic is based on the squared Mahalanobis norm of the sample mean shift 
and is given by T 2 = (ff i — x 2 ) T T'~ 1 (xi — x 2 ), where rii, Xi, and denote, re- 
spectively, the sample sizes, means, and pooled covariance matrix, for random samples 
drawn from two p- dimensional Gaussian distributions, A/"(/ii,£), i — 1,2. Under the 
null hypothesis Ho : /ii = ^2 of equal means, NT 2 follows a (central) F-distribution 
F (p, ni + n 2 — p — 1), where TV = j^^^z^w, ■ m general, NT 2 follows a non-central F- 
distribution F(^^A 2 (5, S);p, n\ + n 2 — p— 1), where the non-centrality parameter is 
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Figure 1: Eigenvectors of the signed Laplacian £ S i gn for a simple four-node graph. The 
corresponding eigenvalues are 0,1,1,4. Nodes are colored according to the value of 
the eigenvector, where green corresponds to high positive values, red to high negative 
values, and black to 0. "T"-shaped edges have negative weights. 

a function of the Mahalanobis norm of the mean shift 5, A 2 (5, E) = 5 T Y>~ 1 5, which we 
refer to as distribution shift. In the remainder of this paper, unless otherwise specified, 
T 2 -statistics are assumed to follow the nominal F-distribution, e.g., for critical value 
and power calculations. 

For any graph- Fourier basis U, direct calculation shows that T 2 = T 2 = ra n ^ — 

x 2 ) T U (u T Y,U^ U 1 \x\ — x 2 ), i.e., the statistic T 2 in the original space and the 

statistic T 2 in the graph-Fourier space are identical. More generally, for k < p, the 
statistic in the original space after filtering out frequencies above k is the same as the 
statistic T| restricted to the first k coefficients in the graph-Fourier space: 

fl = ^j^fa - ^Yu [k] (tfJjIJCfa)" 1 Uj k] {x, - x 2 ) 

= UlU2 (xi - x 2 ) T Ul k U T (ui k U T ±Ul k U T ) + Ul k U T ( Xl - x 2 ), 
n\ + n 2 \ J 

where A + denotes the generalized inverse of a matrix A, the p x k matrix U\k] denotes 
the restriction of U to its first k columns, and l k is a p x p diagonal matrix, with zth 
diagonal element equal to one if i < k and zero otherwise. Note that as retaining the 
first k Fourier components is a non-invertible transformation, this filtering indeed has 
an effect on the test statistic, that is, we have T 2 ^ T 2 in general. As the Mahalanobis 
norm is invariant to linear invertible transformations, using an invertible filtering (such 
as weighting each Fourier component according to its corresponding eigenvalue) would 
have no impact on the test statistic. 

Hotelling's T 2 -test is known to behave poorly in high dimension; the following 
lemma shows that gains in power can be achieved by filtering. Specifically, let 5 = U T 5 
and £ = U T 'EU denote, respectively, the mean shift and covariance matrix in the graph- 
Fourier space. Given k < p, let A 2 , (5, S) = SZ-i (£[*,•]) £[fc] denote the distribution shift 
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restricted to the first k dimensions of S and E, i.e., based on only the first k elements 
of 5, (5i : i < k), and the first k x k diagonal block of E, (aij : i,j < k). Under 
the assumption that the distribution shift is smooth, i.e., lies mostly at the beginning 
of the graph spectrum, so that A 2 ,^, E) is nearly maximal for a small value of k, 
Lemma [I] states that performing Hotelling's test in the graph- Fourier space restricted 
to its first k components yields more power than testing in the full graph-Fourier 
space. Equivalently, the test is more powerful in the original space after filtering than 
in the original unfiltered space. Note that this result holds because retaining the first 
k Fourier components is a non-invertible transformation. 

Lemma 1. For any level a and any 1 < I < p — k, there exists d(a, k,l) > such that 
Al +l (~8, E) - Al(l S) < d(a, k, I) =► p a , k (Al(l E)) > p a , k+l (&l + i(l S)), 

where (5 a)k {A 2 ) is the power of Hotelling's T 2 -test at level a in dimension k for a 
distribution shift A 2 , according to the nominal F- distribution F( n "^ A 2 ; k, 77,1 + 77,2 — 
k-1). 



Proof . This lemma is a direct application of Corollary 2.1 in iDas Gupta and Perlman 
1974 1 to Hotelling's T 2 -test in th e graph- Fourier space. The bottom line of the proof 



of IDas Gupta and Permian! |1974j 's result is that j3 a ^ can be shown to be a continuous 
and strictly decreasing function of k, so that a strictly positive increase in the non- 
centrality parameter A 2 of the F-distribution is necessary to maintain power when 
increasing dimension. □ 

In particular, a direct application of Lemma [1] yields the following Corollary: 

Corollary 1. J/V 1<1 <p-k, A 2 (5, E) = A 2 k+l (5, E), then 

(3 aik (A 2 k CS,t)) > (3 aMl (At +l (S,t)). 

According to Corollary [TJ if the distribution shift lies in the first k Fourier coeffi- 
cients, then testing in this subspace yields strictly more power than using additional 
coefficients. In particular, if there exists k < p such that 5j = V j > k (i.e., the 
mean shift is smooth) and E is block-diagonal such that cr, j = V i < k,j > k, then 
gains in power are obtained by testing in the first k Fourier components. Although 
non-necessary, this condition is plausible when the mean shift lies at the beginning of 
the spectrum, as the coefficients which do not contain the shift are not expected to be 
correlated with the ones that do contain it. 

Note that the result in Lemma [T] is even more general, as testing in the first k 
Fourier components can increase power even when the distribution shift partially lies 
in the remaining components, as long as the latter portion is below a certain threshold. 
Figure [2] illustrates, under different settings, the increase in distribution shift necessary 
to maintain a given power level against the number of added coefficients. 

If for some reason one expects that the mean shift S is smooth (rather than the dis- 
tribution shift A), i.e., 6 lies at the beginning of the spectrum, and that the covariance 
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between coefficients that contain the shift and those that do not is non-zero, then one 
should use test statis tics based on estimators of t he unstandardized Eucl idean norm \\5\ \ 
of th is shift, e.g., Z Bai and Saranadasa . 1996| [Equation (4.5)] or T n Chen and Qinl . 
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Results similar to Lemma [T] can be derived for these statistics. Namely, the cor- 
responding tests gain asymptotic power when applied at the beginning of the spectrum, 
provided the Euclidean norm of 5 only incre ases moderately as coefficie nts for higher 
freq uencies are added. Th e results follow from lBai and Saranadasa! 19961 ] [Theorem 4.1] 
and lChen and Qinl |2010| [Equations (3.11)— (3.12)], using the fact that, by Cauchy's in- 
terlacing theorem, the trace of the square of any positive semi-definite matrix is larger 
than the trace of the square of any principal submatrix. 
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Figure 2: Left: Increase in distribution shift required for Hotelling's T 2 -test to maintain 
a given power when increasing the number of tested Fourier coefficients: A 2 ^ — A 2 , 
vs. I such that /3 Qj fc + /(A 2 +/; ) = /^(A 2 .). Power (3 a ^+i(Al +l ) computed under the non- 
central F-distribution F(^^Al +l ] k + Z,ni + n 2 - (k + I) - 1), for m = n 2 = 100 
observations, k — 5, and a = 10 -2 . Each line corresponds to the fixed shift A 2 , and 
power /3 a fc(A^) pair indicated in the legend. Right: Zoom on the first 30 dimensions. 



4 Non-homogeneous subgraph discovery 

A systematic approach for discovering non-homogeneous subgraphs, i.e., subgraphs of 
a large graph that exhibit a significant shift in means, is to test all of them. In practice, 
however, this can represent an intractable number of tests, so it is important to be able 
to rapidly identify sets of subgraphs that all satisfy the null hypothesis of equal means. 
To this end, we devise a pruning approach based on an upper bound on the value of 
the test statistic for any subgraph containing a given set of nodes. 
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4.1 Exact algorithm 

Given a large graph Q with p nodes, we adopt the following classical branch-and-bound- 
like approach to test subgraphs of size q < p at level a. We start by checking, for each 
node in Q, whether the Hotelling T 2 -statistic in the first k graph-Fourier components of 
any subgraph of size q containing this node can be guaranteed to be below the level-a 
critical value T^ k (e.g., (1 — a)-quantile of F (k, ri\ + n 2 — k — 1) distribution). If this 
is the case, the node is removed from the graph. We then repeat the procedure on the 
edges of the remaining graph and, iteratively, on the subgraphs up to size q — 1, at 
which point we test all the remaining subgraphs of size q. 

Specifically, for a subgraph g of Q of size q < p, Hotelling's T 2 -statistic in the first 
k < q graph-Fourier components of g is defined as 

f ^9) = ^^<M9) - X2{g)YU [k] (u^ k] t{g)U [k ^ 1 UlpM - x 2 (g)), 

where U[k] is the q x k restriction of the matrix of q eigenvectors of the Laplacian of g to 
its first k columns (i.e., U[k]{g), where we omit g to ease notation) and Xi(g), i = 1,2, 
and S((?) are, respectively, the empirical means and pooled covariance matrix restricted 
to the nodes in g. We make use of the following upper bound on T^(g). 

Lemma 2 (Upper bound on T^). For any subgraph g of Q of size q < p, any subgraph 
g' of g of size q' < q, and any k < q, then 

ft{g)<T\v{g',q-q')), 

where u(g',r) is the r -neighborhood of g' , that is, the union of the nodes of g' and the 
nodes whose shortest path to a node of g' is less than or equal to r. 

The proof involves the following result: 

Lemma 3 (Bessel inequality for Mahalanobis norm). Let S G M p,p be an invertible 
matrix and P G M. p,k , k < p, be a matrix with orthonormal columns. For any x G W , 

x t e~ i x > x T p (p T Epy 1 p t x. 

Proof. First note that, by orthonormality of the columns of P, P T SP is indeed invert- 
ible, and that 

ir 1 - p (p t sp) _1 p T = £-5 h - (> T £5£5p) 1 P T S i^ s -| ; 

where Eip (P T £^p) P T sl is an orthogonal projection, with eigenvalues either 

or 1. Thus, I—YAP ^P T S^S^pj P T £^ is positive -semi-definite, as its eigenvalues 

are also either or 1. The result follows from properties of products of positive-semi- 
definite matrices. □ 
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We can now prove Lemma [2j 



Proof. By Lemma [31 



ni+n 2 

( Xl (g) - x 2 (g)) T ±{gy x {x x {g) - x 2 (g)) = T 2 (g). 



ni + n 2 



As g C v(g' ,q — q'), applying Lemma [3] a second time with the compression from 
y {S ' t Q ~ to the nodes of g yields the result. □ 

Note that the bound takes into account the fact that the T 2 -statistic is eventually 
computed in the first few components of a basis which is not known beforehand : at 
each step, for each potential subgraph g' which would include the subgraph g which we 
consider for pruning, the T^(g') that we need to upper bound depends on the graph 
Laplacian of g' . 



4.2 Mean-shift approximation 

For "small-world" graphs above a certain level of connectivity and q large enough, the 
(q — s)-neighborhood of g', v(g', q — s), tends to be large, at least at the beginning of 
the above exact algorithm, and the number of tests actually performed won't decrease 
much compared to the total number of possible tests. One can, however, identify much 
more efficiently the subgraphs whose sample mean shift in the first k components of 

the graph-Fourier space has Euclidean norm ||5[fc](<7)|| = \\U^(xi(g) — x 2 (g))\\ above a 
certain threshold. Indeed, it is straightforward to see that 

WU^iMg) - x 2 (g))\\ 2 < r T (*i(<?) - x 2 (g))\\ 2 

= \\xi(g) - x 2 (g)\\ 2 
< \\ Xl {g') - x 2 {g')\\ 2 
+ max \\x 1 (v 1 ,...,v g _ s )-x 2 (v 1 ,...,Vg_ s )\\ 2 . 

vx,...,v q - a &v(g' ,q— s) 

This inequality can then be used in the procedure of Section I4.1[ to identify all sub- 
graphs for which the Euclidean norm of the sample mean shift exceeds a given thresh- 
old: ||<5[fc](g) || 2 > 0. For any a, if this threshold 9 is low enough, all the subgraphs 
with T%(g) > T^ k are included in this set. Performing the actual T 2 -test on these pre- 
selected subgraphs yields exactly the set of subgraphs that would have been identified 
using the exact procedure of Section H~T1 More precisely, we have the following result: 

Lemma 4. For any threshold 9 > 0, k < q < p, and any subgraph g of size q such that 

\\l [k] (g)\\ 2 <°, 



Nn x n 2 9 



NT k \g) > Ti M =► X min (E [k] (g)) < ^ - ^ 



2 ' 
a,k 
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where T^ k is the level-a critical value for T k (e.g., (1 — a)-quantile of Fo(k,ni + 
n 2 — k — 1)), N = ™^+n2-2)l an d ^mm(^[fc](flO) denotes the smallest eigenvalue of 
= U [k] t(g)U^ k] . 

Proof. As I — (£[fc](g)) _1 A m j n (£[fc]((7)) >z 0, it follows that, for any x, 



\x\\ 2 



A m m(£[fc](sO) 

□ 

Lemma H] states that for any subgraph which would be detected by Hotelling's 

T 2 -statistic T k (g) but not by the Euclidean criterion ||<5[fc]((7) || 2 , the sample covariance 
matrix in the restricted graph- Fourier space has an eigenvalue below a certain threshold. 
This implies that such false negative subgraphs (from the Euclidean approximation to 
the exact algorithm) always have a small mean shift in the graph- Fourier space, but 
in a direction of small variance. In context of gene expression, this is related to the 
well-known issue of the detection of DE genes by virtue of their small variances. Even 
though the differences in expression appear to be highly significant for these genes, 
they correspond to small effects that are not interesting from a practical point of view 
(i.e., biologically insignificant). Methods for addressing this problem are proposed 

Lonnstedt and Speed [200 1| . Note that A m j n (S(^))) < A m j n (£[fc] (#))); thus, the 



in 



remark on variances holds for both the graph-Fourier and original spaces. However, if 
q is large, we expect A m j n (£(<?)) to be very small, while filtering somehow controls the 
conditioning of the covariance matrix. 



4.3 Multiple testing 

Testing for homogeneity over the potentially large number of subgraphs investigated as 
part of the above algorithms immediately raises the issue of multiple testing. However, 
the present multiplicity problem is unusual, in the sense that one does not know in 
advance the total number of tests and which tests will be performed specifically. S tan- 



arc 



dard multiple testing procedures, such as those in lDudoit and van der Laanl [2008 
therefore not immediately applicable. 

In an attempt to address the multiplicity issue, we apply a permutation proce- 
dure to control the number of false positive subgraphs under the complete null hy- 
pothesis of identical distributions in the two populations. Specifically, one permutes 
the class/population labels (1 or 2) of the n\ + n 2 observations and applies the non- 
homogeneous subgraph discovery algorithm to the permuted data to yield a certain 
number of false positive subgraphs. Repeating this procedure a sufficiently large num- 
ber of times produces an estimate of the distribution of the number of Type I errors 
under the complete null hypothesis of identical distributions. 
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5 Results 



We evaluate the empirical behavior of the procedures proposed in Sections [3] and HJ 
first on synthetic data, then on breast cancer microarray data analyzed in context of 
KEGG pathways. 



5.1 Synthetic data 

The performance of the graph-structured test is assessed in cases where the distribution 
shift A 2 satisfies the smoothness assumptions described in Section [31 We first generate 
a connected random graph with p = 20 nodes. Next, we generate 10, 000 datasets, each 
comprising ri\ = n 2 = 20 Gaussian random vectors in MP, with null mean shift 5 for 
5,000 datasets and mean shift 5 = 1 for the remaining 5,000. For the latter datasets, 
the non-zero shift is built in the first ko = 3 Fourier coefficients (the shift being zero 
for the remaining p — ko coefficients) and an inverse Fourier transformation is applied 
to random vectors generated in the graph-Fourier space. We consider two covariance 
settings: in the first one, the covariance matrix in the graph-Fourier space is diagonal 
with diagonal elements at In the second one, correlation is introduced between 

the shifted coefficients only. Specifically, for i,j < k Q , Sjj = ^| if % ^ j, E 4i = ^= 
otherwise. 

Figure |3] displays receiver operator characteristic (ROC) curves for mean shift de- 
tection by the standard Hotelling T 2 -test, T 2 in the first ko Fourier co efficients, T 2 



in the first ko principal components (PC), the adaptive Neyman test of iFan and Lin 



19981 ] . and a modified version of this test where the correct value of kp i s specified. Note 



that we do not consider sparse learning approaches [Jacob et al.l . 12009k iJenatton et al. 



2009], but it would be straightforward to design a realistic setting where such ap- 
proaches are outperformed by testing, e.g., by adding correlation between some of the 
functions under Hi. 

The first important comparison is between the classical Hotelling T 2 -test versus 
the T 2 -test in the graph-Fourier space. As expected from Lemma [TJ testing in the 
restricted space where the shift lies performs much better than testing in the full space 
which includes irrelevant coefficients. The difference can be made arbitrarily large by 
increasing the dimension p and keeping the shift unchanged. The graph-structured test 
retains a large advantage even for moderately smooth shifts, e.g., when ko = 3 and 
p = 5. Of course, this corresponds to the optimistic case where the number of shifted 
coefficients ko is known. Figure H] shows the power of the test in the graph- Fourier space 
for various choices of k. Even when missing some coefficients (k < k ) or adding a few 
non-relevant ones (k > k ), the power of the graph-structured test is higher than that of 
the T 2 -test in the full space. The principal component a pproach is shown becaus e it was 



proposed for the application which motivated our work [Ma and Kosoroki . 120091 ] and as 
it illustrates that the performance improvement originates not only from dimensionality 
reduction, but also from the fact that this reduction is in a direction that does not 
decrease the shift. We emphasize that power entirely depends on the nature of the 
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Figure 3: ROC curves for the detection of a smooth shift using various test statistics. 
Left: Diagonal covariance structure. Right: Block-diagonal covariance structure. 



shift and that a PC-based test would outperform our Fourier-based test when the shift 
lie s in the first principal com pon ents rather than Four ier coefficients. The statistics 
of iBai and Saranadasal 19961 ] and IChen and Qml 20101 ] are also largely outperformed 
by our graph-structured statistic (ROC curves not shown in Figure [3] for the sake 
of readability), which illustrates that working in the graph- Fourier space solves the 
problem of high-dimensionality for which these statistics were designed. Here again, 
for a non-smooth shift, the comparison would be less favorable. Finally, we consider 
the adaptive Neyman test of iFan and Lin! 19981 ]. which takes advantage of smoothness 
assumptions for time-series. This test differs from our graph-structured test, as Fourier 
coefficients for stationary time-series are known to be asymptotically independent and 
Gaussian. For graphs, the asymptotics would be in the number of nodes, which is 
typically small, and necessary conditions such as stationarity are more difficult to define 
and unlikely to hold for data like gene expression mea surements. In the uncorrelated 
setting, the modified version of the lFan and Linl 1998| statistic based the true number 
of non-zero coefficients performs approximately as well as the graph-structured T 2 . 
However, for correlated data, it loses power and both versions can have arbitrarily 
degraded performance. This, together with the n eed to use the boot strap to calibrate 
this test, illustrates that direct transposition of the lFan and Linl 19981 ] test to the graph 
context is not optimal. 

To evaluate the performance of the subgraph discovery algorithms proposed in 
Section HI we generated a graph of 100 nodes formed by tightly-connected hubs of sizes 
sampled from a Poisson distribution with parameter 10 and only weak connections 
between these hubs (Figure [5]). Such a graph structure mimics the typical topology of 
gene regulation networks. We randomly selected one subgraph of 5 nodes to be non- 
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Figure 4: Power of the T 2 -test in the graph-Fourier space with an actual mean shift 
evenly distributed among the first k = 5 coefficients. 

homogeneous, with smooth shift in the first A; = 3 Fourier coefficients. The mean shift 
was set to zero on the rest of the graph. We set the norm of the mean shift to 1 and 
the covariance matrix to identity, so that detecting the shifted subgraph is impossible 
by just looking at the mean shift on the graph. 

We evaluated run-time for full enumeration, the exact branch-and-bound algorithm 
based on Lemmal2l (Section l4. ip . and the approximate algorithm based on the Euclidean 
norm (Section 14.21) . We also examined run-time on data with permuted class labels, 
as the subgraph discovery procedure is to be run on such data to evaluate the number 
of false positives and adjust for multiple testing. Averaging over 20 runs, the full 
enumeration procedure took 732 ± 9 seconds per run and the exact branch-and-bound 
627 ± 59 seconds on the non-permuted data and 578 ± 100 seconds on permuted data. 
Over 100 runs, the approximation at 9 — 0.5 (A m j n = 0.52) took 204 ± 86 seconds 
(129 ± 91 on permuted data) and the approximation at 9 = 1 (X m in — 1-04) took 
183 ± 106 seconds (40 ± 60 on permuted data). The latter approximation missed the 
non-homogeneous subgraph in 5% of the runs. 

While neither the exact nor the approximate bounds are efficient enough to allow 
systematic testing on huge graphs for which the exact approach would be impossible, 
they allow a significant gain in speed, especially for permuted data, and will thus prove 
to be very useful for multiple testing adjustment. 
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Figure 5: Random graph used in the evaluation of the pruning procedure. 



5.2 Breast cancer expression data 



We also validated our methods using the microarray dataset of lLoi et al.l [20081 ] . which 
comprises expression measures for 15, 737 genes in 255 patients treated with tamoxifen. 
Using distant metastasis free survival as a primary endpoint, 68 patients are labeled as 
resistant to tamoxifen and 187 are labeled as sensitive to tamoxifen. Our goal was to 
detect structured groups of genes which are differentially expressed between resistant 
and sensitive patients. 

We first tested individually 323 connected components from 89 KEGG pathways 
corresponding to known gene regulation networks, using the classical Hotelling T 2 -test 
and the T 2 -test in the graph-Fourier space retaining only the first 20% Fourier coef- 
ficients (k = 0.2p). For each of the 323 graphs, (unadjusted) p- values were computed 
under the nominal F-distributions Fq(p, rt\ + ri2 — p — 1) and F^(k, n\ + ri2 — k — 1), re- 
spectively. Figure [6] shows the pathway for which the ratio of graph- Fourier to full space 
p- values is the lowest (i.e., most significant for graph-structured test relative to classical 
test) and the pathway for which it is the highest. As expected, the former corresponds 
to a shift which appears to be coherent with the network (even on edges corresponding 
to inhibition), while the latter is a small network with non- smooth shift. More gener- 
ally, the classical approach tends to select very small networks. The coherent pathway 
selected by our graph-structured test corresponds to Leukocyte trans endothelial mi- 
gration. To the best of our knowledge, this pathway is not specifically known to be 



16 



involved in tamoxifen resistance. However, its role in resistance is plausib le, as leuko 



cyte infiltration was recently found to be involved in breast tumor invasion [Man! . 12010 



more generally, the immune system and inflammatory response are closely-related to 
the evolution of cancer. 

We then ran our branch-and-bound non-homogeneous subgraph discovery proce- 
dure on the cell cycle pathway, which, after restriction to edges of known sign (inhi- 
bition or activation), has 86 nodes and 442 edges. Specifically, we sought to detect 
differentially expressed subgraphs of size q — 5, after pre-selecting those for which the 
squared Euclidean norm of the empirical shift exceeded 9 = 0.1; for a test in the first 
k = 3 Fourier components at level a = 10~ 4 , this corresponded to \ m i n < 0.23 and 
to an expected removal of 95% of the subgraphs under the approximation that the 
squared Euclidean norm of the subgraphs follows a Xs-distribution. 

For a = 10~ 4 , none of the 50 runs on permuted data gave any positive subgraph and 
31 overlapping subgraphs (Figure [7]) were detected on the original data, corresponding 
to a connected subnetwork of 22 genes. Some of these genes have large individual 
differential expression, namely TP53 whose mu t ation has been long-known to be in - 



volved in tamoxifen resistance Andersson et al.. 2005. Fernandez-Cuesta et al.. 2010 



E2F1, whose expre ssion level was recently shown to be involved in tamoxifen re sistance 



Louie et al. . 2010|. is also part of the identified network, as well as CCND1 Barnes 



19971 . iMusgrove and Sutherland! 12009 



Some other genes in the network have quite 
low t-statistics and would not have been detected individually . This is the case of 



CCNE1 and CDK2, which were also described in Louie et all |2010| as part of the 
same mechanism as E2F1. Similarly, CDKN1A was recently found t o be involved in 
anti-cestrogene treatment resistance Musgrove and Sutherland!. 120091 and in ovarian 



cancer which is also a hormone-dependent cancer [Cunningham et al.l . l2009j . The net- 



works also contains RBI, a tumor suppressor whose expression or l oss is known to be 
correlated to tamoxifen resistance Musgrove and Sutherland!. 120 09 



RBI is inhibited 

as 



2009 



by CDK4, whose inhibition has been described in Sutherland and Musgrove 
acting synergistically with tamoxifen or trastuzumab. More generally, a large part 



of the network displayed on Figure 2 A of IMusgrove and Sutherland! [2009| is included 
in our network, along with other known actors of tamoxifen resistance. Our system- 
based approach to pathway discovery therefore directly identifies a set of interacting 
important genes and may therefore prove to be more efficient than iterative individual 
identification of single actors. 



6 Discussion 

We developed a graph-structured two-sample test of means, for problems in which the 
distribution shift is assumed to be smooth on a given graph. We proved quantitative re- 
sults on power gains for such smooth-shift alternatives and devised branch-and-bound 
algorithms to systematically apply our test to all the subgraphs of a large graph. The 
first algorithm is exact and reduces the number of explicitly tested subgraphs. The sec- 
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ond is approximate, with no false positives and a quantitative result on the type of false 
negatives (with respect to the exact algorithm). The non- homogeneous subgraph dis- 
covery method involves performing a larger number of tests, with highly-dependent test 
statistics. However, as the actual number of tested hypotheses is unknown, standard 
multiple testing procedures are not directly applicable. Instead, we use a permutation 
procedure to estimate the distribution of the number of false positive subgraphs. Such 
resampling procedures (bootstrap or permutation) are feasible due to the manageable 
run-time of the pruning algorithms of Section SJ Results on synthetic data illustrate 
the good power properties of our graph-structured test under smooth-shift alternatives, 
as well as the good performance of our branch-and-bound-like algorithms for subgraph 
discovery. Very promising results are also obtained on the drug resistance microarray 



dataset of Loi et al. 2008 



Future work should investigate the use of other bases, such as graph- wavelets [Hammond et al. 



20091 ] . which would allow the detection of shifts with spatially- located non-smoothness, 
for example, to take into account errors in existing networks. More systematic pro- 
ce dures for cutoff selection should also be considered, e.g., two-st ep method proposed 
in iDas Gupta and Permian! 19741 ] or adaptive approaches as in iFan and Lin! 1998 



The pruning algorithm would naturally benefit from sharper bounds. Such bounds 
could be obtained by controlling the condition number of all covariance matrices, us- 
ing, for example, regul arized statistics which still have known non-asymptotic distribu- 



tions, such as those of iTai and Speed! [20081 ] . Concerning multiple testing, procedures 



should be devised to exploit the dependence structure between the tested subgraphs 
and to deal with the unknown number of tests. The proposed approach could also be 
enriched to take into account different types of data, e.g., copy number for the detec- 
tion of DE gene pathways. More s ubtle notions of smoothness, e.g., "and" and "or" 



logical relations [Vaske et al.l . |2010| . could also be included. An interesting alternative 
application would be to explore the list of pathways which are known to be differen- 
tially expressed (or detected by the classical T 2 -test), but which are not detected by 
the graph-Fourier approach, to infer possible mis-annotation in the network. Other 
applications of two-sample tests with smooth-shift on a graph include fMRI and eQTL 
association studies. 

Finally, it would be of interest to compare our testing approach with structured 
sparse learning, for the purpose of identifying expression signatures that are predictive 
of drug resistance. Methods should be compared in terms of prediction accuracy and 
stability of the selected genes acr oss different datasets , a central and difficult problem 



in th e design of such signatures [Ein-Dor et al.l . 120051 iHe and Yul . 120101 lHaury et al. 



20101 ] . The comparison should also take into account the merits of the sparsity-inducing 
norm over the hypothesis testing-based selection, as well as the influence of the smooth- 
ness assumption. The lat ter could indeed a lso be integrated in a sparsity-inducing 
penalty by applying, e.g., IJacob et al.l [20091 ] to the reduced graph-Fouri er represen- 



tatio n of the pathways, yielding a special case of multiple kernel learning [Bach et al. 
2004 . 
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Figure 6: Difference in sample mean expression measures between tamoxifen-resistant 
and sensitive patients, for genes in two KEGG regulation networks. Top: Regulation 
network (Leukocyte transendothelial migration) with the lowest ratio of graph- Fourier 
to full space p-values. Bottom: Regulation network (Alzheimer's disease) with the 
highest ratio of graph-Fourier to full space p-values. Nodes are colored according to 
the value of the difference in means, with green corresponding to high positive values, 
red to high negative values, and black to 023led arrows denote activations, blue arrows 
inhibition. 
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Figure 7: Difference in sample mean expression measures between tamoxifen-resistant 
and sensitive patients, for genes in the two overlapping subgraphs detected at a = 
10 -4 . Nodes are colored according to the value of the difference in means, with green 
corresponding to high positive values, red to high negative values, and black to 0. Red 
arrows denote activations, blue arrows inhibition. 
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